Loan default prediction is important because it helps lenders and financial institutions to assess the risk of granting loans to borrowers. By predicting the likelihood of a borrower defaulting on a loan, lenders can make informed decisions on whether to approve the loan, how much to lend, and at what interest rate. This helps to reduce the risk of financial losses due to default and can improve the overall stability of the financial system.
This data set was collected from Github repository. In the case of this data the default column: 1 means they paid off their loan and 0 is the opposite. There are about 39,685 data points within this table.
selected <- default2 %>%
select(default, loan_amnt, annual_inc, income_loan_ratio)
datatable(selected, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE),
extensions="Scroller")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
datatable(default, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE),
extensions="Scroller")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
The probability of the loaning being paid back is based on the ratio of money loaned out to the debtors annual income is given by the following logistic regression model.
\[ P(Y_i = 1|x_i) = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i \]
In this model, for each previous shuttle launch \(i\):
If \(\beta_1\) is zero in the above model, then \(x_i\) (ratio of money loaned) provides no insight about the probability of repayment. If one however, then the ratio plays an important role in the probability of repayment. Using a significance level of \(\alpha = 0.05\) we will test the below hypotheses about \(\beta_1\).
\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]
The estimates of the coefficients \(\beta_0\) and \(\beta_1\) for the above logistic regression model and data are shown below.
default.log <- glm(default ~ income_loan_ratio, data= default2, family=binomial)
summary(default.log) %>% pander()
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 2.284 | 0.0283 | 80.69 | 0 |
| income_loan_ratio | -2.441 | 0.1147 | -21.28 | 1.641e-100 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 32489 on 39684 degrees of freedom |
| Residual deviance: | 32053 on 39683 degrees of freedom |
This gives the estimated model for \(\pi_i\) as \[
P(Y_i = 1|x_i) \approx \frac{e^{2.284-2.441x_i}}{1+e^{2.284 -
2.441x_i}} = \hat{\pi}_i
\] where \(b_0 = 2.284\) is the
value of the (Intercept) which estimates \(\beta_0\) and \(b_1 = -2.441x_i\) is the value of
income_loan_ratio which estimates \(\beta_1\).
Importantly, the \(p\)-value for the
test of income_loan_ratio shows a significant result \((p = 1.641e-100)\) giving sufficient
evidence to conclude that \(\beta_1 \neq
0\). The loan income ratio effects the probability of the loan
being repaid.
palette(c("purple3", "grey22"))
plot(default ~ income_loan_ratio, data=default2, xlab="Loan Percent of Income ", ylab="Repayment Probability", main="Probability of Repayment", col= as.factor(default), pch=18)
b <- coef(default.log)
curve(exp(b[1]+b[2]*x)/(1+exp(b[1]+b[2]*x)), add = TRUE)
legend("topright", col=palette(), pch=18, legend=c("Default", "Repaid"), bty="n", text.col = palette())
Looking at the plot above you can see that the chance of the loan being paid back decreases as the percent of income increases.
To demonstrate that the logistic regression is a good fit to these
data we apply the Hosmer-Lemeshow goodness of fit test (since there are
only a couple repeated \(x\)-values)
from the library(ResourceSelection).
| Test statistic | df | P value |
|---|---|---|
| 15.36 | 8 | 0.05254 |
Since the null hypothesis is that the logistic regression is a good fit for the data, we claim that the logistic regression is appropriate (p-value = 0.05254).
For a hypothetical situation imagine my income is $10k as a college student and I want to buy a used car for about $5K. That means that my Loan Percent of Income would be about 0.50 the prediction for such would come out to:
predicted <- predict(default.log, newdata= data.frame(income_loan_ratio = 0.5), type="response")
predicted %>% pander()
| 1 |
|---|
| 0.7433 |
The probability of me paying back the loan is about \(74.33%\).